This vignette describes basic usage of the package intended to process several large bedgraph files in R. Methrix provides set of function which allows easy importing of various flavors of bedgraphs generated by methylation callers, and many downstream analysis to be performed on large matrices.
As a first step, we need a list of CpG sites in the respective genome. The CpG sites are listed using the respective Bsgenome annotation package. The read_bedgraph function is also able to extract CpG sites on it’s own, however, it might be beneficial to do it separately.
## -Extracting CpGs
## -Done. Extracted 28,700,086 CpGs from 93 contigs.
An annotation table is also necessary to perform analyses. The data will be added to the methrix object, as a colData slot.
#Example bedgraph files
bdg_files <- list.files(
path = system.file('extdata', package = 'methrix'),
pattern = "*bedGraph\\.gz$",
full.names = TRUE
)
sample_anno <- data.frame(
row.names = gsub(
pattern = "\\.bedGraph\\.gz$",
replacement = "",
x = basename(bdg_files)
),
Condition = c("cancer", 'cancer', "normal", "normal"),
Pair = c("pair1", "pair2", "pair1", "pair2"),
stringsAsFactors = FALSE
)
knitr::kable(sample_anno)| Condition | Pair | |
|---|---|---|
| C1 | cancer | pair1 |
| C2 | cancer | pair2 |
| N1 | normal | pair1 |
| N2 | normal | pair2 |
read_bedgraphs function is a versatile bedgraph reader intended to import bedgraph files generated virtually by any sort of methylation calling program. It requires user to provide indices for chromosome names, start position and other required fields. There are also presets available to import bedgraphs from most common programs such as Bismark (.cov format), MethylDackel, and MethylcTools. In this case, there is no need to define e.g. chr_idx start_idx arguments, the function will automatically assign them. For the sake of fast analysis, here we will use example dataset from methrix.
## [1] "C1.bedGraph.gz" "C2.bedGraph.gz" "N1.bedGraph.gz" "N2.bedGraph.gz"
The read_bedgraphs function adds CpGs missing from the reference set, and creates a methylation/coverage matrices. Once the process is complete - it returns an object of class methrix which in turn inherits SummarizedExperiment class. methrix object contains ‘methylation’ and ‘coverage’ matrices (either in-memory or as on-disk HDF5 arrays) along with pheno-data and other basic info. This object can be passed to all downstream functions for various analysis. For further details on the data structure, see the SummarizedExperiment package.
For computers with large memory, increase speed using vect=TRUE. The function will open multiple files, defined by the vect_batch_size argument. Keep in mind, that increasing the batch size will increase the memory need as well.
If vect=FALSE, the function will iterate along the list of files, keeping only one open at a time.
meth <- methrix::read_bedgraphs(
files = bdg_files,
ref_cpgs = hg19_cpgs,
chr_idx = 1,
start_idx = 2,
M_idx = 3,
U_idx = 4,
stranded = FALSE,
zero_based = FALSE,
collapse_strands = FALSE,
coldata = sample_anno,
vect = TRUE,
vect_batch_size = 4
)## ----------------------------
## -Preset: Custom
## --Missing beta and coverage info. Estimating them from M and U values
## -CpGs raw: 28,700,086
## -CpGs filtered: 28,217,448
## ----------------------------
## -Batch: 1/1
## -Processing: C1.bedGraph.gz
## --CpGs missing: 28,216,771
## -Processing: C2.bedGraph.gz
## --CpGs missing: 28,216,759
## -Processing: N1.bedGraph.gz
## --CpGs missing: 28,216,746
## -Processing: N2.bedGraph.gz
## --CpGs missing: 28,216,747
## -Finished in: 52.4s elapsed (00:01:04 cpu)
Potential pitfalls: Too many missing CpGs, if not expected, might indicate that something went wrong.
An obvious error is using the wrong reference genome.
A common mistake is falsely indicating strandedness and zero or one based positioning. The combination of the two might lead to only using C-s from one strand (due to a 1 bp shift), resulting lower coverage for the sites. A nice description of the zero- and one-based coordination system can be found here
Without any further processing, we can create an interactive html report containing basic summary statistics of the methrix object with methrix_report function.
In the report, we can check genome-wide and chromosome based statistics on coverage and methylation, in order to identify potential quality issues, for example:
samples with too high or too low genome-wide coverage or on a specific chromosome. In this case, there is also a possibility, that the sample is affected by copy number alterations. Both methylation levels and coverage therefore shows higher variation among cancer samples.
samples with altered beta level distribution.
A “bump” on the beta level density plot at 0.5 indicates the presence of single nucleotide polymorphisms (SNPs) -> see later.
To ensure the high quality of our dataset, the sites with very low (untrustworthy methylation level) or high coverage (technical problem) should not be used. We can mask these CpG sites. Please note, that the DSS we were using for Differential Methylation Region (DMR) calling, doesn’t need the removal of the lowly covered sites, because it takes it into account by analysis. However, using a not too restrictive threshold won’t interfere with DMR calling.
## -Masked 318 CpGs due to low coverage.
## -Masked 1 CpGs due to too high coverage in sample C1.
## -Masked 5 CpGs due to too high coverage in sample C2.
## -Masked 5 CpGs due to too high coverage in sample N1.
## -Masked 5 CpGs due to too high coverage in sample N2.
## -Finished in: 6.690s elapsed (4.550s cpu)
Later we can remove those sites that are not covered by any of the samples.
## -Removed 28,216,735 [100%] uncovered loci of 28,217,448 sites
## -Finished in: 14.0s elapsed (18.6s cpu)
We might want to remove the sites that are sparsely covered. If we plan to do differential analysis, we can do it group-wise, or using the whole population.
## -Retained 430 of 713 sites
## -Finished in: 0.920s elapsed (0.610s cpu)
SNPs, mostly the C > T mutations can disrupt methylation calling. Therefore it is essential to remove CpG sites that overlap with common variants, if we have variation in our study population. For example working with human, unpaired data, e.g. treated vs. untreated groups. During filtering, we can select the population of interest and the minimum minor allele frequency (MAF) of the SNPs. SNP filtering is currently implemented for hg19 and hg38 assemblies.
if(!requireNamespace("GenomicScores")) {
BiocManager::install("GenomicScores")
}
library(GenomicScores) ## Used SNP database: MafDb.1Kgenomes.phase3.hs37d5.
## Number of SNPs removed:
## chr N
## 1: chr21 23
## 2: chr22 30
## Sum:
## [1] 53
if (!requireNamespace("pheatmap", quietly = TRUE))
install.packages("pheatmap")
snps <- meth[[2]]
meth <- meth[[1]]
pheatmap::pheatmap(get_matrix(order_by_sd(snps)[1:min(5000, length(snps)),]))region filter function.After filtering, it worth running a report again, to see if any of the samples were so lowly covered that it warrants an action (e.g. removal of the sample)
Important: Don’t forget to set the recal_stats to TRUE, since the object changed since reading in. The output directory has to be different from last time, to avoid using the intermediate files calculated during the last run.
methrix::methrix_report(meth = meth, output_dir = tempdir(), recal_stats = TRUE, prefix="processed")There are additional possibilities to visualize the study. We can look at the density of coverage both sample- and group-wise:
We can visualize te beta value distribution as a violin plot or density plot. These plots (as well as the coverage plot) use the most variable 25000 CpG sites to ensure fast processing. plot_density and plot_violin has the option to use data only from a restricted region.
Methrix offers principal component analysis (PCA) to conduct and visualize dimensionality reduction. As a first step, the PCA model has to be calculated. As for plotting, a given number of sites are selected, either random or based on variance (var option). This ensures that the calculations remain feasible.
At visualization, we can provide the methrix object to allow color or shape annotation of groups or samples.
methrix offers the possibility of region based filtering. With this option, selected regions, e.g. promoters can be visualized. We will use the annotatr package to assess basic annotation categories.
library(annotatr)
promoters = build_annotations(genome = 'hg19', annotations = "hg19_genes_promoters")
promoters## GRanges object with 82960 ranges and 5 metadata columns:
## seqnames ranges strand | id tx_id
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 10874-11873 + | promoter:1 uc001aaa.3
## [2] chr1 10874-11873 + | promoter:2 uc010nxq.1
## [3] chr1 10874-11873 + | promoter:3 uc010nxr.1
## [4] chr1 68091-69090 + | promoter:4 uc001aal.1
## [5] chr1 320084-321083 + | promoter:5 uc001aaq.2
## ... ... ... ... . ... ...
## [82956] chrUn_gl000237 2687-3686 - | promoter:82956 uc011mgu.1
## [82957] chrUn_gl000241 36876-37875 - | promoter:82957 uc011mgv.2
## [82958] chrUn_gl000243 10501-11500 + | promoter:82958 uc011mgw.1
## [82959] chrUn_gl000243 12608-13607 + | promoter:82959 uc022brq.1
## [82960] chrUn_gl000247 5817-6816 - | promoter:82960 uc022brr.1
## gene_id symbol type
## <character> <character> <character>
## [1] 100287102 DDX11L1 hg19_genes_promoters
## [2] 100287102 DDX11L1 hg19_genes_promoters
## [3] 100287102 DDX11L1 hg19_genes_promoters
## [4] 79501 OR4F5 hg19_genes_promoters
## [5] <NA> <NA> hg19_genes_promoters
## ... ... ... ...
## [82956] <NA> <NA> hg19_genes_promoters
## [82957] <NA> <NA> hg19_genes_promoters
## [82958] <NA> <NA> hg19_genes_promoters
## [82959] <NA> <NA> hg19_genes_promoters
## [82960] <NA> <NA> hg19_genes_promoters
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
mpca <- methrix::methrix_pca(m= meth, top_var = 10000, ranges = promoters, n_pc = 3, do_plot = F)
methrix::plot_pca(mpca, m=meth, col_anno = "Condition", shape_anno = "Pair")A very important step of a WGBS data analysis is the differential methylation calling. During this step, we would like to identify changes between groups or conditions first on the site level. Then, we will need to identify larger regions (DMR-s, differentially methylated regions) affected by methylation changes that are more likely to represent functional alterations. Methrix doesn’t have a DMR caller, therefore we will use DSS. At this stage, DSS is not yet able to work directly with methrix, therefore we transform our methrix object to bsseq for the sake of DMR calling.
For more detailed description, options and recommendations on smoothing, please refer to the ´DSS´ vignette
bs_obj <- methrix::methrix2bsseq(meth)
dmlTest = DSS::DMLtest(bs_obj, group1=c("C1", "C2"), group2=c("N1", "N2"),
smoothing = TRUE)DMRs:
kable(dmrs) %>%
kable_styling(bootstrap_options = "striped", full_width = F) %>%
scroll_box(width = "100%", height = "200px")| chr | start | end | length | nCG | meanMethy1 | meanMethy2 | diff.Methy | areaStat | |
|---|---|---|---|---|---|---|---|---|---|
| 14 | chr22 | 49004740 | 49006235 | 1496 | 31 | 0.5781963 | 0.8942923 | -0.3160959 | -107.20207 |
| 5 | chr21 | 47518792 | 47519087 | 296 | 16 | 0.6997633 | 0.1373555 | 0.5624078 | 98.88005 |
| 8 | chr21 | 47537796 | 47538662 | 867 | 16 | 0.4827543 | 0.8208406 | -0.3380862 | -45.88851 |
| 7 | chr21 | 47536948 | 47537698 | 751 | 16 | 0.4975329 | 0.7797307 | -0.2821978 | -42.37607 |
| 11 | chr22 | 45085101 | 45085453 | 353 | 10 | 0.4298283 | 0.8259031 | -0.3960749 | -40.41343 |
| 4 | chr21 | 47286319 | 47287013 | 695 | 12 | 0.5860615 | 0.8665249 | -0.2804633 | -34.55679 |
| 1 | chr21 | 27867791 | 27868434 | 644 | 9 | 0.4111249 | 0.7517059 | -0.3405811 | -29.67288 |
| 12 | chr22 | 45086168 | 45086541 | 374 | 7 | 0.5305423 | 0.7868741 | -0.2563318 | -18.98492 |
| 13 | chr22 | 49004424 | 49004642 | 219 | 7 | 0.6718556 | 0.8557647 | -0.1839091 | -14.43668 |
| 6 | chr21 | 47520470 | 47520710 | 241 | 5 | 0.3155754 | 0.6228646 | -0.3072891 | -13.86936 |
| 2 | chr21 | 27869149 | 27869648 | 500 | 4 | 0.4185388 | 0.8134162 | -0.3948774 | -12.97328 |
| 17 | chr22 | 49007313 | 49007398 | 86 | 5 | 0.7567327 | 0.9309532 | -0.1742205 | -10.84714 |
| 10 | chr21 | 47540244 | 47540455 | 212 | 5 | 0.5135043 | 0.7357068 | -0.2222025 | -10.61340 |
DMLs:
kable(dmls) %>%
kable_styling(bootstrap_options = "striped", full_width = F) %>%
scroll_box(width = "100%", height = "200px")| chr | pos | mu1 | mu2 | diff | diff.se | stat | phi1 | phi2 | pval | fdr | postprob.overThreshold | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 168 | chr21 | 47518998 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0853889 | 6.579540 | 0.0610930 | 0.0446073 | 0.0000000 | 0.0000000 | 1.0000000 |
| 155 | chr21 | 47518792 | 0.7054168 | 0.1202545 | 0.5851623 | 0.0913289 | 6.407197 | 0.0514057 | 0.0591479 | 0.0000000 | 0.0000000 | 0.9999999 |
| 156 | chr21 | 47518800 | 0.7054168 | 0.1202545 | 0.5851623 | 0.0919110 | 6.366621 | 0.0431757 | 0.0628077 | 0.0000000 | 0.0000000 | 0.9999999 |
| 157 | chr21 | 47518812 | 0.6916001 | 0.1352522 | 0.5563479 | 0.0890065 | 6.250641 | 0.0479340 | 0.0607602 | 0.0000000 | 0.0000000 | 0.9999999 |
| 163 | chr21 | 47518896 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0901120 | 6.234678 | 0.0486725 | 0.0592783 | 0.0000000 | 0.0000000 | 0.9999999 |
| 161 | chr21 | 47518867 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0901704 | 6.230640 | 0.0437346 | 0.0430716 | 0.0000000 | 0.0000000 | 0.9999998 |
| 167 | chr21 | 47518935 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0902003 | 6.228576 | 0.1511329 | 0.0570187 | 0.0000000 | 0.0000000 | 0.9999998 |
| 164 | chr21 | 47518917 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0906481 | 6.197809 | 0.0426914 | 0.0494803 | 0.0000000 | 0.0000000 | 0.9999998 |
| 162 | chr21 | 47518877 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0908797 | 6.182011 | 0.0428864 | 0.0440936 | 0.0000000 | 0.0000000 | 0.9999998 |
| 160 | chr21 | 47518851 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0911002 | 6.167053 | 0.0626699 | 0.0640963 | 0.0000000 | 0.0000000 | 0.9999998 |
| 170 | chr21 | 47519087 | 0.7024413 | 0.1588468 | 0.5435945 | 0.0882510 | 6.159642 | 0.0610382 | 0.0503966 | 0.0000000 | 0.0000000 | 0.9999998 |
| 159 | chr21 | 47518846 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0924657 | 6.075977 | 0.0816615 | 0.0626637 | 0.0000000 | 0.0000000 | 0.9999997 |
| 166 | chr21 | 47518926 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0931339 | 6.032382 | 0.0426382 | 0.0509624 | 0.0000000 | 0.0000000 | 0.9999996 |
| 165 | chr21 | 47518924 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0939471 | 5.980171 | 0.0405490 | 0.0526171 | 0.0000000 | 0.0000001 | 0.9999996 |
| 158 | chr21 | 47518842 | 0.6992837 | 0.1374641 | 0.5618196 | 0.0942631 | 5.960121 | 0.0471687 | 0.0643234 | 0.0000000 | 0.0000001 | 0.9999995 |
| 169 | chr21 | 47519054 | 0.6992169 | 0.1509747 | 0.5482422 | 0.0940867 | 5.826988 | 0.0449055 | 0.0479805 | 0.0000000 | 0.0000001 | 0.9999991 |
| 335 | chr22 | 49005349 | 0.4514889 | 0.9044401 | -0.4529512 | 0.0851709 | -5.318149 | 0.1087880 | 0.0362448 | 0.0000001 | 0.0000023 | 0.9999829 |
| 337 | chr22 | 49005393 | 0.4382166 | 0.8942026 | -0.4559860 | 0.0888252 | -5.133523 | 0.0530798 | 0.0513445 | 0.0000003 | 0.0000059 | 0.9999693 |
| 338 | chr22 | 49005405 | 0.4303463 | 0.8911297 | -0.4607835 | 0.0901308 | -5.112387 | 0.0510360 | 0.0441204 | 0.0000003 | 0.0000062 | 0.9999687 |
| 336 | chr22 | 49005371 | 0.4514889 | 0.9044401 | -0.4529512 | 0.0896909 | -5.050135 | 0.0607954 | 0.0643722 | 0.0000004 | 0.0000082 | 0.9999584 |
| 340 | chr22 | 49005581 | 0.4563350 | 0.8954013 | -0.4390663 | 0.0889273 | -4.937363 | 0.0498288 | 0.0539334 | 0.0000008 | 0.0000141 | 0.9999313 |
| 339 | chr22 | 49005551 | 0.4563350 | 0.8954013 | -0.4390663 | 0.0896094 | -4.899778 | 0.0447945 | 0.0412302 | 0.0000010 | 0.0000163 | 0.9999228 |
| 275 | chr22 | 45085101 | 0.3625507 | 0.8274930 | -0.4649423 | 0.0981511 | -4.737004 | 0.0449663 | 0.0552770 | 0.0000022 | 0.0000352 | 0.9998997 |
| 19 | chr21 | 27869149 | 0.0448313 | 0.6676962 | -0.6228649 | 0.1408678 | -4.421627 | 0.0529055 | 0.0379350 | 0.0000098 | 0.0001142 | 0.9998972 |
| 334 | chr22 | 49005299 | 0.4493985 | 0.8826734 | -0.4332749 | 0.0917291 | -4.723417 | 0.0414075 | 0.0545687 | 0.0000023 | 0.0000360 | 0.9998601 |
| 276 | chr22 | 45085113 | 0.3581053 | 0.8077136 | -0.4496084 | 0.0964650 | -4.660842 | 0.0667119 | 0.0914574 | 0.0000031 | 0.0000452 | 0.9998551 |
| 15 | chr21 | 27868252 | 0.3731671 | 0.7994402 | -0.4262731 | 0.0908086 | -4.694194 | 0.0484659 | 0.0609671 | 0.0000027 | 0.0000399 | 0.9998365 |
| 279 | chr22 | 45085158 | 0.3581053 | 0.8077136 | -0.4496084 | 0.0985216 | -4.563553 | 0.0471944 | 0.0513928 | 0.0000050 | 0.0000695 | 0.9998063 |
| 278 | chr22 | 45085153 | 0.3581053 | 0.8077136 | -0.4496084 | 0.1006989 | -4.464879 | 0.0424381 | 0.1077930 | 0.0000080 | 0.0001017 | 0.9997416 |
| 277 | chr22 | 45085130 | 0.3581053 | 0.8077136 | -0.4496084 | 0.1011552 | -4.444737 | 0.1017411 | 0.0866417 | 0.0000088 | 0.0001059 | 0.9997261 |
| 333 | chr22 | 49005192 | 0.4634613 | 0.8820300 | -0.4185687 | 0.0938413 | -4.460388 | 0.0540585 | 0.0785553 | 0.0000082 | 0.0001017 | 0.9996566 |
| 280 | chr22 | 45085308 | 0.4505152 | 0.8320146 | -0.3814995 | 0.0853803 | -4.468239 | 0.0566719 | 0.0613755 | 0.0000079 | 0.0001017 | 0.9995114 |
| 344 | chr22 | 49005951 | 0.4923568 | 0.8833153 | -0.3909585 | 0.0894254 | -4.371894 | 0.0434765 | 0.0414754 | 0.0000123 | 0.0001392 | 0.9994303 |
| 282 | chr22 | 45085355 | 0.4557529 | 0.8537278 | -0.3979749 | 0.0976251 | -4.076564 | 0.0589676 | 0.0787979 | 0.0000457 | 0.0005014 | 0.9988645 |
| 332 | chr22 | 49005145 | 0.5006768 | 0.8829753 | -0.3822986 | 0.0946697 | -4.038236 | 0.1013749 | 0.0462659 | 0.0000539 | 0.0005739 | 0.9985680 |
| 14 | chr21 | 27868217 | 0.3824057 | 0.7921241 | -0.4097185 | 0.1051999 | -3.894665 | 0.1422973 | 0.0704692 | 0.0000983 | 0.0009913 | 0.9983811 |
| 281 | chr22 | 45085330 | 0.4505152 | 0.8320146 | -0.3814995 | 0.0970732 | -3.930019 | 0.0406919 | 0.0453350 | 0.0000849 | 0.0008801 | 0.9981338 |
| 13 | chr21 | 27868022 | 0.3188341 | 0.7096548 | -0.3908207 | 0.1064360 | -3.671886 | 0.0393516 | 0.2277589 | 0.0002408 | 0.0022452 | 0.9968578 |
| 331 | chr22 | 49005139 | 0.5171680 | 0.8806450 | -0.3634770 | 0.0974334 | -3.730517 | 0.0710156 | 0.0676265 | 0.0001911 | 0.0018276 | 0.9965773 |
| 11 | chr21 | 27867985 | 0.3030118 | 0.6966354 | -0.3936236 | 0.1101373 | -3.573936 | 0.0469781 | 0.0423628 | 0.0003517 | 0.0031853 | 0.9961655 |
| 215 | chr21 | 47537470 | 0.5037846 | 0.8382309 | -0.3344463 | 0.0889507 | -3.759906 | 0.0528036 | 0.0588836 | 0.0001700 | 0.0016685 | 0.9958022 |
| 231 | chr21 | 47538444 | 0.4052968 | 0.8266158 | -0.4213191 | 0.1221300 | -3.449759 | 0.0997590 | 0.0563437 | 0.0005611 | 0.0043601 | 0.9957526 |
| 229 | chr21 | 47538409 | 0.3963148 | 0.8139883 | -0.4176736 | 0.1208318 | -3.456653 | 0.1688207 | 0.0444264 | 0.0005469 | 0.0043405 | 0.9957281 |
| 230 | chr21 | 47538440 | 0.4052968 | 0.8266158 | -0.4213191 | 0.1228531 | -3.429453 | 0.0535200 | 0.0440788 | 0.0006048 | 0.0046039 | 0.9955558 |
| 24 | chr21 | 47284793 | 0.4628484 | 0.8460420 | -0.3831936 | 0.1082997 | -3.538270 | 0.0655654 | 0.0541894 | 0.0004028 | 0.0034143 | 0.9955415 |
| 25 | chr21 | 47284838 | 0.4628484 | 0.8460420 | -0.3831936 | 0.1089589 | -3.516863 | 0.0610756 | 0.0536466 | 0.0004367 | 0.0036196 | 0.9953310 |
| 12 | chr21 | 27868007 | 0.3188341 | 0.7096548 | -0.3908207 | 0.1148248 | -3.403625 | 0.0364220 | 0.0762724 | 0.0006650 | 0.0047700 | 0.9943507 |
| 23 | chr21 | 47284703 | 0.4628484 | 0.8460420 | -0.3831936 | 0.1121245 | -3.417571 | 0.0700882 | 0.0627163 | 0.0006318 | 0.0047134 | 0.9942349 |
| 227 | chr21 | 47538333 | 0.3809894 | 0.8009985 | -0.4200091 | 0.1268170 | -3.311932 | 0.0604774 | 0.0780869 | 0.0009265 | 0.0061714 | 0.9942092 |
| 57 | chr21 | 47286743 | 0.4991447 | 0.8612133 | -0.3620685 | 0.1063704 | -3.403845 | 0.0631799 | 0.0558895 | 0.0006644 | 0.0047700 | 0.9931321 |
| 345 | chr22 | 49005995 | 0.5987661 | 0.9083944 | -0.3096283 | 0.0871067 | -3.554586 | 0.0425805 | 0.0454939 | 0.0003786 | 0.0032839 | 0.9919497 |
| 233 | chr21 | 47538549 | 0.4324684 | 0.8125794 | -0.3801110 | 0.1181722 | -3.216586 | 0.3041920 | 0.0955143 | 0.0012973 | 0.0080646 | 0.9911389 |
| 228 | chr21 | 47538362 | 0.4161624 | 0.8146878 | -0.3985254 | 0.1261927 | -3.158070 | 0.1748563 | 0.0744497 | 0.0015882 | 0.0092561 | 0.9910393 |
| 9 | chr21 | 27867791 | 0.4843264 | 0.7809081 | -0.2965817 | 0.0831049 | -3.568764 | 0.0592143 | 0.5521032 | 0.0003587 | 0.0031853 | 0.9909972 |
| 346 | chr22 | 49006004 | 0.5987661 | 0.9083944 | -0.3096283 | 0.0890551 | -3.476816 | 0.0851953 | 0.0467683 | 0.0005074 | 0.0041144 | 0.9907137 |
| 58 | chr21 | 47286758 | 0.4991447 | 0.8612133 | -0.3620685 | 0.1116282 | -3.243523 | 0.0838537 | 0.0521010 | 0.0011806 | 0.0077258 | 0.9905723 |
| 232 | chr21 | 47538540 | 0.4324684 | 0.8125794 | -0.3801110 | 0.1193858 | -3.183887 | 0.2393464 | 0.0505929 | 0.0014531 | 0.0087421 | 0.9905476 |
| 199 | chr21 | 47520710 | 0.2584592 | 0.5953496 | -0.3368904 | 0.1009593 | -3.336894 | 0.0455289 | 0.0717223 | 0.0008472 | 0.0057456 | 0.9905295 |
| 235 | chr21 | 47538572 | 0.4324684 | 0.8125794 | -0.3801110 | 0.1198739 | -3.170923 | 0.3018380 | 0.0414823 | 0.0015196 | 0.0089967 | 0.9903039 |
| 330 | chr22 | 49005081 | 0.5468489 | 0.8680866 | -0.3212377 | 0.0953684 | -3.368387 | 0.2110341 | 0.0579455 | 0.0007561 | 0.0053212 | 0.9898297 |
| 343 | chr22 | 49005781 | 0.5229578 | 0.8432075 | -0.3202497 | 0.0956307 | -3.348816 | 0.0721138 | 0.0478057 | 0.0008116 | 0.0056059 | 0.9893697 |
| 234 | chr21 | 47538562 | 0.4324684 | 0.8125794 | -0.3801110 | 0.1219200 | -3.117707 | 0.1623588 | 0.0558222 | 0.0018226 | 0.0101469 | 0.9892459 |
| 59 | chr21 | 47286763 | 0.5460588 | 0.8826394 | -0.3365807 | 0.1043409 | -3.225778 | 0.0570088 | 0.0527843 | 0.0012563 | 0.0079424 | 0.9883308 |
| 214 | chr21 | 47537365 | 0.5318998 | 0.8468988 | -0.3149990 | 0.0983314 | -3.203441 | 0.0440748 | 0.0741262 | 0.0013580 | 0.0083036 | 0.9856217 |
| 62 | chr21 | 47286790 | 0.6103130 | 0.9093437 | -0.2990307 | 0.0923822 | -3.236888 | 0.1929197 | 0.3636197 | 0.0012084 | 0.0077713 | 0.9844045 |
| 236 | chr21 | 47538610 | 0.4932530 | 0.8423031 | -0.3490501 | 0.1170936 | -2.980950 | 0.0904863 | 0.0534684 | 0.0028736 | 0.0144843 | 0.9833499 |
| 217 | chr21 | 47537522 | 0.5333436 | 0.8461447 | -0.3128012 | 0.1000745 | -3.125683 | 0.0915707 | 0.0748481 | 0.0017739 | 0.0100254 | 0.9832847 |
| 329 | chr22 | 49005011 | 0.5737763 | 0.8802274 | -0.3064512 | 0.0977632 | -3.134628 | 0.0621902 | 0.0378817 | 0.0017207 | 0.0098743 | 0.9826621 |
| 20 | chr21 | 27869506 | 0.5431080 | 0.8619896 | -0.3188816 | 0.1043744 | -3.055170 | 0.1409278 | 0.0467053 | 0.0022493 | 0.0118267 | 0.9820367 |
| 60 | chr21 | 47286771 | 0.5862985 | 0.8980315 | -0.3117331 | 0.1010982 | -3.083468 | 0.0392907 | 0.0534054 | 0.0020460 | 0.0110604 | 0.9819080 |
| 216 | chr21 | 47537517 | 0.5333436 | 0.8461447 | -0.3128012 | 0.1023925 | -3.054922 | 0.0647327 | 0.0539652 | 0.0022512 | 0.0118267 | 0.9811863 |
| 198 | chr21 | 47520632 | 0.2584592 | 0.5953496 | -0.3368904 | 0.1142489 | -2.948741 | 0.0688145 | 0.0501405 | 0.0031907 | 0.0154563 | 0.9810009 |
| 61 | chr21 | 47286783 | 0.6103130 | 0.9093437 | -0.2990307 | 0.0968681 | -3.086988 | 0.0415506 | 0.0529952 | 0.0020220 | 0.0110604 | 0.9800629 |
| 22 | chr21 | 27869648 | 0.5431080 | 0.8619896 | -0.3188816 | 0.1105168 | -2.885368 | 0.0810954 | 0.0385831 | 0.0039096 | 0.0184591 | 0.9762531 |
| 341 | chr22 | 49005673 | 0.5557180 | 0.8783041 | -0.3225861 | 0.1130008 | -2.854724 | 0.1850322 | 0.0487802 | 0.0043074 | 0.0200834 | 0.9756600 |
| 212 | chr21 | 47537252 | 0.4309434 | 0.7627531 | -0.3318097 | 0.1187513 | -2.794156 | 0.0875586 | 0.0456337 | 0.0052035 | 0.0235306 | 0.9746728 |
| 342 | chr22 | 49005702 | 0.5387623 | 0.8470871 | -0.3083247 | 0.1084777 | -2.842287 | 0.1666430 | 0.0527486 | 0.0044791 | 0.0206261 | 0.9726823 |
| 63 | chr21 | 47286854 | 0.6285885 | 0.8950792 | -0.2664907 | 0.0883737 | -3.015497 | 0.2101066 | 0.0636676 | 0.0025656 | 0.0132912 | 0.9702302 |
| 211 | chr21 | 47537232 | 0.4309434 | 0.7627531 | -0.3318097 | 0.1232631 | -2.691883 | 0.0997421 | 0.0830046 | 0.0071050 | 0.0288061 | 0.9702174 |
| 10 | chr21 | 27867918 | 0.4843264 | 0.7809081 | -0.2965817 | 0.1062405 | -2.791606 | 0.1074741 | 0.0532601 | 0.0052447 | 0.0235306 | 0.9679628 |
| 21 | chr21 | 27869534 | 0.5431080 | 0.8619896 | -0.3188816 | 0.1221247 | -2.611114 | 0.1116294 | 0.0393736 | 0.0090248 | 0.0349723 | 0.9637577 |
| 287 | chr22 | 45086364 | 0.4888850 | 0.7693777 | -0.2804927 | 0.1016514 | -2.759358 | 0.0528480 | 0.0458265 | 0.0057915 | 0.0253349 | 0.9621918 |
| 210 | chr21 | 47537208 | 0.4435084 | 0.7425336 | -0.2990252 | 0.1146361 | -2.608474 | 0.7582071 | 0.0729645 | 0.0090947 | 0.0349723 | 0.9589812 |
| 64 | chr21 | 47286962 | 0.6729210 | 0.9111849 | -0.2382639 | 0.0800606 | -2.976043 | 0.0780957 | 0.0443244 | 0.0029199 | 0.0145218 | 0.9579272 |
| 290 | chr22 | 45086516 | 0.6076341 | 0.8390252 | -0.2313910 | 0.0772711 | -2.994535 | 0.0467562 | 0.0475440 | 0.0027486 | 0.0140444 | 0.9554802 |
| 195 | chr21 | 47520470 | 0.3536529 | 0.6412079 | -0.2875550 | 0.1111714 | -2.586592 | 0.0723913 | 0.0412574 | 0.0096930 | 0.0368929 | 0.9544510 |
| 237 | chr21 | 47538637 | 0.5470838 | 0.8449563 | -0.2978725 | 0.1177522 | -2.529654 | 0.1436890 | 0.0405465 | 0.0114175 | 0.0409493 | 0.9539255 |
| 291 | chr22 | 45086539 | 0.6076341 | 0.8390252 | -0.2313910 | 0.0783440 | -2.953526 | 0.0515093 | 0.0495863 | 0.0031417 | 0.0154190 | 0.9532506 |
| 249 | chr21 | 47539196 | 0.6531351 | 0.8994914 | -0.2463563 | 0.0883547 | -2.788265 | 0.1681109 | 0.0860229 | 0.0052991 | 0.0235306 | 0.9512303 |
| 288 | chr22 | 45086376 | 0.4888850 | 0.7693777 | -0.2804927 | 0.1095869 | -2.559545 | 0.0431374 | 0.0476229 | 0.0104809 | 0.0387068 | 0.9504820 |
| 292 | chr22 | 45086541 | 0.6076341 | 0.8390252 | -0.2313910 | 0.0796762 | -2.904141 | 0.0515093 | 0.0405770 | 0.0036826 | 0.0176105 | 0.9504485 |
We can visualize certain regions and/or DMRs using the Gviz package. The plotting function is included in the best-practices.Rmd file.
genome <- "hg19"
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="feb2014.archive.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
i <- 1
candidate_region <- GRanges(paste0(dmrs[i,"chr"], ":", dmrs[i,"start"], "-", dmrs[i,"end"]))
region_plot(m = meth, region=candidate_region, mart=mart, genome=genome, groups=meth$Condition, dmrs=dmrs)## Cache found
## -Subsetting by genomic regions
heatmap_data <- as.data.frame(methrix::get_region_summary(meth, dmrs))
heatmap_data <- heatmap_data[complete.cases(heatmap_data),]
rownames(heatmap_data) <- paste0(heatmap_data$chr,":", heatmap_data$start, "-", heatmap_data$end)
pheatmap::pheatmap(heatmap_data[,-(1:4)], show_rownames = TRUE, annotation_col = as.data.frame(meth@colData),
fontsize = 8)if(!requireNamespace("plotly")) {
install.packages("plotly")
}
if(!requireNamespace("ggplot2")) {
install.packages("ggplot2")
}
if(!requireNamespace("scales")) {
install.packages("scales")
}
library(plotly)
library(ggplot2)
library(scales)count <- c("Higher methylation in tumor" = nrow(dmrs[dmrs$diff.Methy<0,]),
"Higher methylation in control" = nrow(dmrs[dmrs$diff.Methy>0,]))
length <- c("Higher methylation in tumor" = sum(dmrs[dmrs$diff.Methy<0,"length"]),
"Higher methylation in control" = sum(dmrs[dmrs$diff.Methy>0,"length"]))
data <- data.frame(Direction=c("Gain", "Loss"), Count=count, Length=length)
g <- ggplot(data=data)+geom_col(aes(x=factor(1), y=Count, fill=Direction), position = "dodge")+theme_bw()+theme(axis.title.x = element_blank(), axis.text.x = element_blank())+scale_fill_brewer(palette = "Dark2")+ggtitle("Number of differentially methylated regions")+scale_y_continuous(labels = comma)
ggplotly(g) p <- ggplot(data=data)+geom_col(aes(x=factor(1), y=Length, fill=Direction), position = "dodge")+theme_bw()+theme(axis.title.x = element_blank(), axis.text.x = element_blank())+scale_fill_brewer(palette = "Dark2")+ggtitle("Length of differentially methylated regions")+scale_y_continuous(labels = comma)
ggplotly(p)To describe and interpret the differentially methylated regions, we can use the ChIPseeker package. The location of the DMRs can already give us a hint the involved processes, but once we assign the regions to genes, a pathway enrichment analysis is also possible. For additional details, see the vignette of ChIPseeker:
## loading packages
if(!requireNamespace("ChIPseeker")) {
BiocManager::install("ChIPseeker")
}
if(!requireNamespace("TxDb.Hsapiens.UCSC.hg19.knownGene")) {
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
}
if(!requireNamespace("ReactomePA")) {
BiocManager::install("ReactomePA")
}
if(!requireNamespace("clusterProfiler")) {
BiocManager::install("clusterProfiler")
}
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(ReactomePA)
library(clusterProfiler) dmrs <- makeGRangesFromDataFrame(dmrs, keep.extra.columns = T)
peakAnnoList <- lapply(list("Higher in tumor"=dmrs[dmrs$diff.Methy<0,], "Higher in normal" = dmrs[dmrs$diff.Methy>0,]), annotatePeak, TxDb=txdb, tssRegion=c(-3000, 3000), verbose=FALSE)
plotAnnoBar(peakAnnoList)genes <- lapply(list("Higher in tumor"=dmrs[dmrs$diff.Methy<0,], "Higher in normal" = dmrs[1:3,]), seq2gene, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
names(genes) <- gsub(" ", "\n", names(genes))
comppthw <- compareCluster(geneCluster = genes,
fun = "enrichPathway",
pvalueCutoff = 0.05,
pAdjustMethod = "BH")
dotplot(comppthw, font.size = 8)read_bedgraph functionFor memory efficient read in, one can use an HDF5 based methrix object. Only one bedgraph file is in the memory at the same time, while the resulting object won’t be stored in the memory, but on-disk. Additional arguments to use HDF5: - Set h5=TRUE - Set vect=FALSE - h5_dir –> a directory to save the final object. It is possible to save the object later. It increases the processing time significantly.
- h5temp –> a temporary directory to use during data processing. Set this if for example the default temporary location doesn’t have enough free space to store the temporary data.
meth <- methrix::read_bedgraphs(
files = bdg_files,
ref_cpgs = hg19_cpgs,
chr_idx = 1,
start_idx = 2,
M_idx = 3,
U_idx = 4,
stranded = FALSE,
zero_based = FALSE,
collapse_strands = FALSE,
coldata = sample_anno,
vect = FALSE,
h5 = TRUE)## ----------------------------
## -Preset: Custom
## --Missing beta and coverage info. Estimating them from M and U values
## -CpGs raw: 28,700,086
## -CpGs filtered: 28,217,448
## ----------------------------
## -Processing: C1.bedGraph.gz
## --CpGs missing: 28,216,771
## -Processing: C2.bedGraph.gz
## --CpGs missing: 28,216,759
## -Processing: N1.bedGraph.gz
## --CpGs missing: 28,216,746
## -Processing: N2.bedGraph.gz
## --CpGs missing: 28,216,747
## -Finished in: 00:01:10 elapsed (00:01:21 cpu)
All methrix functions work with HDF5-based objects as well, there is no difference in using different functions.
## -Removed 28,216,705 [100%] uncovered loci of 28,217,448 sites
## -Finished in: 39.5s elapsed (42.4s cpu)
It is also possible to transform non-HDF5-based objects to HDF5-based ones and back.
Saving and loading of an HDF5-based object is not possible using the standard save or saveRDS functions. methrix offers easy to use saving and loading tools, which are essentially wrappers around the saveHDF5SummarizedExperiment and loadHDF5SummarizedExperiment functions.